Introduction

Column

Abstract

Wine Quality analysis is a statistical study of features affecting the quality of wine. In the Wine Quality data set there are about 11 features that affect it’s quality. The quality is measured on a scale of 0-10, this work envisons to study the specific features that play a key role in determining the quality of wine by two different statistical tools - Multiple regression and Logistic regressions. The box plots of features and their corresponding effects on Quality is shown for better and clear understanding. Due to some limitations in multiple regression such as collinearity and others, we have adapted Logistic regression. Finally as an out-of-the box step, we have tested the predicitability of this model to ensure if the model developed out of this data set can be used on some other data set.

Insight to features affecting the Wine Quality

  ï..FA   VA   CA  RS    CL FS TS      D   PH    S   A Q Y
1   7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5 0
2   7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20 0.68 9.8 5 0
3   7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26 0.65 9.8 5 0
4  11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16 0.58 9.8 6 1
5   7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5 0
6   7.4 0.66 0.00 1.8 0.075 13 40 0.9978 3.51 0.56 9.4 5 0

The abbrevations used have the following meanings:
1- FA - Fixed Acidity
2- VA - Volatile Acidity
3- CA - Citric Acid
4- RS - Residual Sugar
5- CL - Chlorides
6- FS - Free Sulfur Dioxide
7- TS - Total Sulfur Dioxide
8- D - Density
9- PH - pH
10- S - Sulphates
11- A - Alcohol
12- Q - Quality

Column

Motivation

Wine, once an expensive good is now increasingly enjoyed by variety of consumers. In fact, Portugal is one of the top ten wine exporting country with about 32% of the market share in 2005 [5]. It’s export has increased to about 36% in 2007. Therefore new technologies has been adapted to enhance the making and selling of this wine. In this process there are two major steps: Wine Certification and Quality Assessment.

While Certification ensures the prevention of illegal adulteration, Quality evaluation which is a part of certification process is an indicator that is used for improving the wine making by identifying the most important features which thereby helps to classify wines as premium brands.

Generally, wine certification is done using phsiochemical and sensory tests, wherein the former is used to characterize wine based on density, alcohol or pH values while sensory test rely on human senses. Since the taste is the least understood of human senses [6] the relationships between these two tests are very difficult to understand, wine classification becomes an onerous task.

In such an atmosphere with the help of the technologies the data pertaining to this Wine Quality are collected and stored. These data contain important informations that explains trends and features on which the quality of wine depends. Based on this data and its associated information it is possible to improvise the quality by performing statistical analysis.

Therefore in this work, we have collected the data set pertaining to Wine Quality [4] on which we performed two types of stastical analysis: Multiple regression and Logistic regression. With these analysis we have extracted the important features that affect the wine quality and validated it with measures of “Goodness of fit”. At the end as an out-of-the box initiative we performed a prediction on this data set besides classification by developing a model, ensure that this model can be used on some other data sets too.

Methods

In the previous sections a brief idea about the Wine quality data set was shown, while in this section this raw data needs to be analyzed. Since there are 11 independant variables with Quality being the response variable, there are two basic approaches that comes to our mind:

  1. Multiple Regression - This is one of the basic regression models that can be applied to find out the nature of relationship between the dependant and the independant variable. It also helps us to determine the nature of relationship between the different variables in the data set.

  2. Logistic Regression - Due to shortcomings in Linear regression such as its inability to deal with categorical variables, it will be better and ideal to use logistic regression. Along with this simple logistic regression, combining prediction based analysis of this model will provide a tangible conclusion.

But before performing any analysis, we need to determine the effect of features on the quality through box plot in the next section. With this idea we will first implement Mutliple followed by Logistic regressions.

Histogram

Data Exploration

Column

Fig.1: Effect of RS

Fig.2: Effect of FS

Column

Fig.3: Effect of CA

Fig.4: Effect of VA

Column

Fig.5: Effect of Chlorides

Fig.6: Effect of TS

Column

Fig.7: Effect of FA

Fig.8: Effect of pH

Column

Fig.9: Effect of Density

Fig.10: Effect of Alcohol

column

Fig.11 Effect of Sulphates

Fig.12 Histogram of Quality

Diagnostics

Column

Pair-wise relationship of independant variables

In this figure a relationships between the dependant variables are shown. This shows the effect of one feature in presence of another, whether these are positively or negatively correlated or if they are not at all related.

Column

Crazy diagnostics (\(\color{red}{\text{This is not correct way!!!!!}}\))

Despite knowing the fact that this model violates the conditions of model adequacy, for the sake of knowing more diagnostic of this data set using the simple linear regressions was done and the following plots were done.

These plots totally do not make any sense. It is totally impossible to draw any conclusions from such diagnostics. The main reason for such improper plots is due to the consideration of the response variable. In my view I feel since the response variable is categorical in nature the linear or multiple regression analysis will not work here.

Multiple Regression

Column

Model Adequacy Checking

Before the application of the regression analysis, it is pertinent to check if the following conditions are satisfied or not. Violation of these conditions would result in unstable models.[7]

  1. Linearity Assumption
  2. Zero Mean Assumption
  3. Independant Assumption
  4. Equal Varience Assumption
  5. Normality Assumption

Test for Normality

This test is conducted mainly to examine the normality condition of any given data set. The null hypothesis of this test is that the population is normally distributed.

In this case, our p-value is 2.2e-16 < 0.05 = \(\alpha\) the level of significance. Reject the null hypothesis. This shows that this data set is not normally distributed. Therefore the normality assumption is violated.


    Shapiro-Wilk normality test

data:  Z$Q
W = 0.85759, p-value < 2.2e-16

column

Collinearity

Collinearity, which is defined as the dependance relation between the independant variables is an important factor that needs to be considered. This might in fact sabotage the model by producing low and undesirable results.

There are two different methods to examine the collinearity in this case:

  1. From pair-wise relationship plot

  2. Varience Inflation factor (VIF)

Inference from pair-wise relationship plot

From sheer obervation of the pair-wise plot we can see a positive correlation between pH and volatile acidity (VA); a negative correlation between fixed acidity and pH. Similarly, the citric acid, acidity, and pH are all correlated as they together determine the acidity.

Varience Inflation Factor

VIF gives an excat values which aids us to conclude about the collinearity. Any VIF values greater than 2 can be considered to be suffering from collinearity. Since this model suffers from collinearity as well as fails to satisfy the normality assumption, multiple regression cannot be used to this data set

   ï..FA       VA       CA       RS       CL       FS       TS        D 
2.887544 1.318209 1.710269 1.335600 1.198213 1.439199 1.494775 2.569294 
      PH        S        A        Q 
1.831266 1.186651 1.665065 1.098369 

Logistic Regression

Row

Logistic Regression Analysis

This table is similar to the regression analysis obtained using linear regression. This table comprises of all the variables that are both significant and non-significant. By removing the non-significant variables, we can determine the features affecting the Wine Quality.


Call:
glm(formula = Y ~ ., family = binomial(link = "logit"), data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.4025  -0.8387   0.3105   0.8300   2.3142  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  42.949948  79.473979   0.540  0.58890    
ï..FA         0.135980   0.098483   1.381  0.16736    
VA           -3.281694   0.488214  -6.722 1.79e-11 ***
CA           -1.274347   0.562730  -2.265  0.02354 *  
RS            0.055326   0.053770   1.029  0.30351    
CL           -3.915713   1.569298  -2.495  0.01259 *  
FS            0.022220   0.008236   2.698  0.00698 ** 
TS           -0.016394   0.002882  -5.688 1.29e-08 ***
D           -50.932385  81.148745  -0.628  0.53024    
PH           -0.380608   0.720203  -0.528  0.59717    
S             2.795107   0.452184   6.181 6.36e-10 ***
A             0.866822   0.104190   8.320  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2209.0  on 1598  degrees of freedom
Residual deviance: 1655.6  on 1587  degrees of freedom
AIC: 1679.6

Number of Fisher Scoring iterations: 4

Row

Discussion

The summary of the statiscal output to model Wine Quality with the other variables is the shown in adjacent block. This summary comprises of all the variables, their p-values and their significance. By performing simple t-tests we can find out the variables that contributes significantly and does not. Based on this decision, the values that do not contribute significantly are removed and the corresponding Goodness of fit is computed. One sample t-test is shown below.

t-test sample

This same method shown below was used to check all the other variables in this model. \(\beta_{1}\) : Change in the Quality for a unit percent change in the fixed acidity.

\(H_{0}\) : \(\beta_{1}\) = 0 versus \(H_{1}\): \(\beta_{1}\) \(\neq\) 0.

The p-value is 0.16736 > 0.05 = \(\alpha\) the level of significance. Failed to reject \(H_{0}\). Therefore the fixed acidity does not contribute significantly to the model provided the other variables are present in the model.

In this way, the non-significant variables are identified (FA,RS,D,pH) and removed. Therefore the rest of the variables except for these 4 contribute the most to the given model as per the summary table. Now a new model based on these significant variables is developed, Goodness of fit, sensitivity, specificity, concordance, ROC, predictability is determined.

Goodness of fit

There is a famous quote \(\color{red}{\text{All Models are wrong but some are useful}}\). Since there are no good models, we can find out models that fits a given set of observations. To find and conclude how well a particular fits a set of observations \(\color{red}{\text{Goodness of fit}}\) is used.

Measures of Goodness of fit

There are several measures of fit available that can be used for computation purposes. These measures are sometimes classified into \(\color{red}{\text{Global}}\) and \(\color{red}{\text{Logical}}\). They are:
• Chi-square goodness of fit tests and deviance
• Hosmer-Lemeshow tests
• Classification tables
• ROC curves
• Logistic regression R2
• Model validation via an outside data set or by splitting a data set

Analysis

Column

Measure of Goodness of fit

Out of the several measures for goodness of fit mentioned we chose to do Chi-square, ROC curves, McFadden and Prediction by data splitting. Besides this, the sensitivity, specificity, concordance and accuracy too will be measured.

McFadden Pseduo \(R^{2}\)

The McFadden Pseduo \(R^{2}\), this can be mathematically modelled as, (1- \(\frac{\ln{lm_{1}}}{\ln{lm_{0}}}\)); where \(\ln{lm_{1}}\) is the log likelihood of the fitted model and \(\ln{lm_{0}}\) is that of the null model. Typically, the idea value of this McFadden \(R^{2}\) lies between 0 and 1, the best or ideal possible will be about 0.40 as it is very hard to get higher value.

 McFadden 
0.2494976 

Likelihood Ratio Test

This test is used to find out how much our model as improved. Adding lots of predictors will improve the model. Here our old model ‘model’ and new model ‘fit’ reduced the deviations by over 1000. It is a known fact that the goal of logistic regression is to minimize the deviance residuals.

Analysis of Deviance Table

Model: binomial, link: logit

Response: Y

Terms added sequentially (first to last)


      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                   1598     2209.0              
ï..FA  1   14.613      1597     2194.4 0.0001320 ***
VA     1  161.169      1596     2033.2 < 2.2e-16 ***
CA     1    4.325      1595     2028.9 0.0375457 *  
CL     1   11.952      1594     2016.9 0.0005458 ***
FS     1    6.470      1593     2010.4 0.0109730 *  
TS     1   87.094      1592     1923.3 < 2.2e-16 ***
S      1   79.579      1591     1843.8 < 2.2e-16 ***
A      1  185.930      1590     1657.8 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Column

###ROC

Specificity, Sensitivity and Concordance

Sensitivity is the number of 1’s (actuals) predicted by the model while specificity is the number of 0’s (actuals) predicted by the model. This particular data set has a sensitivity of about 71% and specificity of about 79%.

Concordance is the percentage of (1-0 pairs) whose actual positives are greater than the negatives. In general higher the concordance the better the model. This model has a concordance of about 82%.

[1] 0.7069892
[1] 0.7871345
$Concordance
[1] 0.8217789

$Discordance
[1] 0.1782211

$Tied
[1] 5.551115e-17

$Pairs
[1] 636120

Accuracy

Accuracy is defined as the (538+662)/1500 = 80%
    0   1
0 526 182
1 218 673

Prediction

Column

Significance of Prediction

One of the major measure for Goodness of fit is prediction. Spilting the data set into training and test, developing a model based on the training data set and validating it using the test data is one of the best measures accepted. In this work as a final step, we have done the same.

Out of the 1500 observations, 80% of the data that is about 1200 is used for the training while the remaining 20% that is about 300 is used for as test data. In summary:

  1. Spilting the data set into training and test

  2. Developing a model with the training data set

  3. Apply the logistic regression

  4. Calculate the Goodness of fit

  5. Sensitivity, Specificity, Accuracy, Concordance

ROC and Area under the curve

This is the key factor that decides the predictability of the model. Greater the Area under the curve greater will be its predictability. The main idea behind ROC is its ability to trace the true positives when the prediction probability cut off is reduced from 1 to 0. Here the AuC is about 80% which is pretty much good.

column

Goodness of fit

As mentioned in the previous section, there are several measures of Goodness of fit.

         llh      llhNull           G2     McFadden         r2ML 
-621.3861644 -829.6153188  416.4583087    0.2509948    0.2932290 
        r2CU 
   0.3914429 

Sensitivity, Specificity and Concordance

This model developed has a sensitivity of 81%, specificity of about 70% and concordance of about 82%.

[1] 0.8493151
[1] 0.6722222
$Concordance
[1] 0.8207002

$Discordance
[1] 0.1792998

$Tied
[1] -2.775558e-17

$Pairs
[1] 39420

Discussions

column

Comparisons

Wine Quality analysis has been done by various authors. There are 3 different works performed on the same data set. In [1], the different combinations (bi-variate and multivariate) analysis has been done. In that analysis, pair of variables and their effect on the overall quality was examined. In [2], on the same data set different types of regressions were performed: Linear, polynomial, Multiple, Logistic and prediction analysis. Here a maximum accuracy of about 70 % was obtained, while in our work we have achieved about 80 % accuracy. While in [3], on the entire data set a linear modelling was done. Different measures such as \(R^{2}\) were shown. These values seem to be comparitively lower compared to those obtained and shown by 1.

column

References

[1] https://rpubs.com/prasad_pagade/wine_quality_prediction

[2] http://rstudio-pubs-static.s3.amazonaws.com/438329_edfaab4011ce44a59fb9ae2d216d8dea.html

[3] https://www.kaggle.com/sagarnildass/red-wine-analysis-by-r

[4] https://archive.ics.uci.edu/ml/index.php

[5] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. “Modeling wine preferences by data mining from physicochemical properties”

[6] D. Smith and R. Margolskee “Making sense of taste. Scientific American,Special issue”.

[7] Dr. Chen Notes Chapter 4 “Model Adequacy Checking”.